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Abstract 

We introduce a model for the evolution of species triggered by gener- 
ation of novel features and exhaustive combination with other available 
traits. Under the assumption that innovations are rare, we obtain a bursty 
branching process of speciations. Analysis of the trees representing the 
branching history reveals structures qualitatively different from those of 
random processes. For a tree with n leaves generated by the introduced 
model, the average distance of leaves from root scales as (logn)^ to be 
compared to log n for random branching. The mean values and standard 
deviations for the tree shape indices depth (Sackin index) and imbalance 
(Colless index) of the model are compatible with those of real phyloge- 
netic trees from databases. Earlier models, such as the Aldous' branching 
(AB) model, show a larger deviation from data with respect to the shape 
indices. 

1 Introduction 

Since the seminal work by Darwin [11] , the evolution of biological species has 
been recognized as a complex dynamics involving broad distributions of tempo- 
ral and spatial scales as well as stochastic effects, giving rise to so-called frozen 
accidents. There is vast exchange and overlap of concepts and methods between 
the theory of evolution and the foundations of complex systems such as fitness 
landscapes [35l [121 [20] and neutral networks [19 , the evolution of cooperation 
[3] and self-organized criticality [4 to name but a few. 

A striking feature of biological macroevolution is its burstiness. The tem- 
poral distribution of speciation and extinction events is highly inhomogeneous 
in time [28]. As described by the theory of punctuated equilibrium [13], a 
connection between punctuated equilibrium in evolution and the theory of self- 
organized criticality [4 is established through the model by Bak and Sneppen 
[5l[29]. Ecology, i.e. the system of trophic interactions and other dependencies 



between species' fitnesses, is driven to a critical state. Then minimal perturba- 
tions cause relaxation cascades of broadly distributed sizes. 

Rather than through ecological interaction across possibly all species, bursty 
diversification may also be due to adaptive radiation as a rapid multiplication 
of species in one lineage after a triggering event. About 200 million years ago, 
a novel chewing system with dedicated molar teeth evolved in the lineage of 
mammals, allowing it to rapidly diversify into species using vastly distinct types 
of nutrition [33 . There are many more examples where a single innovation 
triggers adaptive radiation such as the tetrapod limb morphology caused by a 
binary shift in bone arrangement ^2 homeothermy as a key innovation 

by the group of mammals ^14l [21] . Environmental conditions a species has not 
encountered previously, e.g. when entering a geographical area with unoccupied 
ecological niches, may also be the source of adaptive radiation. The diversity 
of finch species on Galapagos islands is the famous example first studied by 
Darwin. Spontaneous phenotypic or genetic innovations and those caused by 
the pressure to adapt to a change in environment are treated on the same footing 
for the modeling purposes in this contribution. Though being a central concept 
in the theory of evolution, the term innovation has not been ascribed a unique 
definition so far [23 . 

Here we study a branching process to mimic the evolution of species driven 
by innovations. The process involves a separation of time scales. Rare inno- 
vation events trigger rapid cascades of diversification where a feature combines 
with previously existing features. We call this newly defined branching process 
innovation model. 

How can the validity of models of this kind be assessed? The evolutionary 
history of species is captured by phylogenetic trees. These are binary trees 
where leaves represent extant species, alive today, and inner nodes stand for 
ancestral species from which the extant species have descended. By comparing 
the shapes of these trees [26\ [171 (HI [S] , in particular their degree of imbalance 
[H [22] , with trees generated by different evolutionary mechanisms [2j [6l [15] , a 
selection of realistic models is possible. 

2 Stochastic models of macroevolution 

We consider models of macroevolution within the following formal framework. 
At each point in time t, there is a set of species S{t). Evolution proceeds as 
follows. A species s G S{t) is chosen according to a probability distribution 
7r(5,t) on S{t). Speciation of s means replacing s by two new species s' and s" 
such that 

Sit + l) = Sit)\{s}U{s',s"} (1) 

is the set of species at time t -\- 1. The initial condition (at t = 1) is a single 
species. Therefore discrete time t and number of species n are identical, n = 
\S{t)\=t. 
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Figure 1: Comparison of tree shapes. Each tree of size eight consists of a root 
(white diamond), a set of inner nodes (black squares) and a set of leaves (gray 
circles). The left tree is totally imbalanced, also called comb tree, with depth 
= 35/8 = 4.375 and Colless index c = 21/21 = 1 . The right tree is a complete 
binary tree with depth d = 24/8 = 3 and Colless index c = 0/21 = 0. 

2.1 Trees 

The evolutionary history of organisms is represented by a phylogenetic tree. For 
the purpose of this contribution, a phyologenetic tree is a rooted strict binary 
tree T: a tree with exactly one node (the root) with degree two or zero, all 
other nodes having degree three (inner node) or one (leaf node) , cf. illustrations 
in Figure [T] For such a tree T with root a subtree T' is obtained as the 
component not containing w after cutting an edge {i^j} of T. T' is again a 
rooted strict binary tree. Since this contribution focuses on tree shape, all 
edges have unit length. The distance between nodes i and j on a tree T is the 
number of edges contained in the unique path between i and j. 

From the evolutionary dynamics, an evolving phylogenetic tree T{t) is ob- 
tained as follows. At each time step t, the leaves of T{t) are the species S{t). 
When s undergoes specation, two new leaves s' and s" attach to a leaf s. After 
this event, s is an inner node and no longer a leaf of the tree. In this way, each 
model of speciation dynamics also defines a model for the growth of a binary 
tree by iterative splitting of leaves. 

2.2 Yule model 

In the simplest case, the probability of choosing a species is uniform at each 
time step, 7r(s,t) = l/t. This is the Yule model or ERM model. It serves as a 
null model of evolution. 

The model corresponds to a particularly simple probability distribution on 
the set of generated trees. For a tree with n > 2 leaves generated by the Yule 
model and iG{l,2...,n — 1}, let Perm(^I^) be the probability that exactly i 
leaves are in the left subtree of the root. Then Perm(^I^) = 1/(^ ~ !)• This is 
shown inductively as follows. Obtaining exactly i leaves at step n, either they 
were already present at the previous step and the speciation took place in the 
right subtree, or the number increased from i — 1 to z by speciation in the left 
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subtree. Addition of these products of probabilities for the two cases yields 

Ti — 1 — i i — 1 

PERM{i\n) = z—pERM{i\n - 1) H -PERuii - Mn - 1) . (2) 

n — 1 n — 1 

With the induction hypothesis PermOI^ ~ 1) = 1/(^ ~ 2) for all j, we obtain 

. (n-l-i) + (i-l) 1 

Perm ^ n = — — = . 3 

[n — l){n — 2) n — 1 

The induction starts with Perm(1|2) = 1 which holds because a tree with two 
leaves has one leaf each in the left and in the right subtree. Thus the uniform 
selection of species turns into a uniform distribution on the number of nodes in 
the left or right subtree. Note that the same distribution applies to each subtree 
of an ERM tree. Therefore Perm fully describes the statistical ensemble of ERM 
trees. The probability of obtaining a particular tree is the product of Perm terms 
taken over all subtrees. This becomes particularly relevant for modifications of 
the model taking p non-uniform, see the following subsection. 

2.3 Aldous' branching (AB) model 

The class of beta-splitting models defines a distribution of trees by the proba- 
bility 

^ 1 r(/3 + / + i)r(/3 + n-/ + i) , ^ 

^^(^1^) = M^) r(/ + i)r(n-/ + i) 

with appropriate normalization factor a/3(n). Analogous to Perm of the previous 
subsection, P(3{i\n) is the probability that a tree has i out of its n leaves in the 
left subtree. In order to build a tree with n leaves, one first decides according 
to P/3(i|n) to have i leaves in the left and n — i leaves in the right subtree. Then 
the same rule is applied to both subtrees with the determined number of leaves. 
The recursion into deeper subtrees naturally stops when a subtree is decided to 
have one leaf. 

The parameter P G [—2; +oo[ in Equation Q tunes the expected imbalance. 
By increasing /3, equitable splits with i ^ n/2 become more probable. The 
probability distribution of trees from the Yule model is recovered by taking /3 = 
0. The case P = —1.5 is called Proportional to Distinguishable Arrangements 
(PDA). It produces a uniform distribution of all ordered (left-right labeled) trees 
of a given size n [25] El [30l [10] . 

Another interesting case is Aldous' branching (AB) model [1^^ obtained for 
/3 = — 1, where Equation [4] reads 

p-i{i\n) (X ^ . (5) 
i[n — i) 

Blum and Francois have found that /3 = — 1.0 is the maximum-likelihood choice 
of j3 over a large set of phylogenetic trees [6 . Therefore we use it as a standard 
of comparison. The AB model does not have an interpretation in terms of 
macroevolution, as noted by Blum and Francois |6]. In particular, it is unknown 
if its probability distribution of trees can be obtained by stochastic processes of 
iterated speciation as introduced at the beginning of this section. 
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2.4 Activity model 



In the activity model [15], the set of species S{t) is partitioned into a set of active 
species SA{t) and a set of inactive species Si{t). At each time step, a species 
s G SA{t) is drawn uniformly if SA{t) is non-empty. Otherwise s G Si{t) is 
drawn uniformly. The two new species s' and s" independently enter the active 
set SA{t^l) with probability p. The activation probability p is a parameter of 
the model. For p = 0.5 a critical branching process is obtained. Otherwise the 
model is similar to the Yule model. A variation of the activity model has been 
introduced by Herrada et al. [16^ in the context of protein family trees. 

2.5 Age-dependent speciation 

In the age model [18], the probability of speciation is inversely proportional to 
the age of a species. At each time, a species s G S{t) is drawn with probability 

ns{t) (X Ts{t)-^ (6) 

normalized properly. The age is the number of time steps passed since cre- 
ation of species s. 



2.6 Innovation model 



Algorithm 1: Pseudocode for the innovation model 



set t = 1, F(0) = 0, S{0) = {0}; 

while \S{t)\ < TV do // as final size of simulated tree 

if S{t) \{s\{(l)}:se S{t), (t) e F{t)} + then 
// loss event 
draw (j) G F{t) uniformly; 
draw s G S{t) uniformly; 
if s \ {0} ^ S{t) then 

^(t + i) = ^(t)u{AW}; 

F{t^l)=F{t); 
increment t; 

else 

// innovation event 

draw s G S{t) uniformly; 
set = 1 + max(F(t) U {0}); 
set S{t^l) = U{sU{0}}; 
set F{t + 1) = F{t) U {(^}; 
increment t: 



In the innovation models each species s is defined as a finite set of features 
5 C N. Features are taken as integer numbers in order to have an infinite supply 
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Figure 2: A tree of five leaves generated by the innovation model. The root node 
labeled with the empty feature set speciates by an innovation event adding the 
feature 1 to the feature set. This results in the species and {1} . Innovation 
events are performed, generating features until a loss event is possible. The loss 
event generates the species {3} by removing the feature 1 from {1,3}. 

of symbols. We denote by F{t) the set of all features existing at time t, that is 
F{t) = Usg5(^) ^' Each speciation occurs as one of two possible events. 

An innovation is the addition of a new feature (j) e N\F{t) not yet contained 
in any species at the given time t. One of the resulting species carries the new 
feature, s' = s U {(j)}. The other species has the same features as the ancestral 
one, s^^ = s. 

A loss event generates a new species by the disappearance of a feature. 
A feature (f) is drawn from F(t) uniformly. The loss event is performed only 
if s \ {^} ^ S{t) such that elimination of (j) from s actually generates a new 
species. In this case, the resulting species are the one having suffered the loss, 
s' = s \ {(j)} and the species s" = s remaining unaltered. Otherwise, <p is not 
present in s or its loss would lead to another already existing species, so nothing 
happens. 

We assume that creation of novel features is significantly less abundant than 
speciation by losses. This separation of time scales is implemented by the rule 
that an innovation event is only possible when no more losses can be performed. 
In order to facilitate further studies with the model, we provide a pseudocode 
description in Algorithm [T] Figure [2] shows an example of the dynamics. 

3 Comparison of simulated and empirical data 
sets 

Now let us compare the tree shapes obtained by the models with those of evolu- 
tionary trees in databases. The TreeBASE [27] database contains phylogenetic 
information about the evolution of species whereas the database PANDIT [3^ 
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(a) 



(b) 



(c) 



Figure 3: Empirical and siraulated trees. The depicted phylogenetic tree in 
(a) is from the database TreeBASE (Matrix ID M2957, relationships in rosids 
based on mitochondrial matR sequences), (b) is a tree created as a realization 
of the innovation model and (c) a tree from the ERM (Yule) model. Each of 
the trees has 161 leaves. 
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contains phylogenetic trees representing protein domains. Analysing the prop- 
erties with reference to the tree shape of both data sets and applying a compar- 
ative study with statistical data sets of different models one can conclude how 
well a growth model constructs "real" trees. 

Comparison by simple inspection of trees from real data and models may 
already reveal substantial shape differences. Figure [3] shows an example. The 
trees in panels (a) and (b) are less compact than that of panel (c) of Figure [sj 

For an objective and quantitative comparison of trees, we use the following 
two measures of tree shape. Compactness is best described by the distance di 
of a leaf i from root being small. The depth (or Sackin index) [26 is the average 
distance of leaves from root, 

d = 2^ . (7) 

n 

The Colless index measures the average imbalance of a tree [9 . The imbal- 
ance at an inner node j of the tree is the absolute difference Cj = \lj — Vj \ of 
leaves in the left and right subtree rooted at j. Then the average of imbalances 

'^=(n-l)(n-2)g^^- 

with appropriate normalization is the Colless index c of the tree. The index j 
runs over all n — 1 inner nodes including the root itself. We find c = for a 
totally balanced tree and c = 1 for a comb tree, see also Figure [l] 

Ensemble mean values and standard deviations of these indices are shown in 
Figure |4j Comparing the results of three models (ERM, AB and innovation) to 
those of trees from two databases, the least discrepancy is obtained between the 
innovation model and the trees from TreeBASE, representing macroevolution. 
In Figure m the averages of the two indices are shown after rescaling to facilitate 
the comparison. Of all models, the values of the innovation model are also best 
matching those of PANDIT. 



4 Depth scaling in the innovation model 

4.1 Subtree generated by an innovation 

Suppose the i-th innovation, generating feature z, affects a species s with / 
features. Then s is removed from the set S of extant species, turning into an 
inner node in the tree. Two new species and s^^ are attached, having feature 
sets s' = s and s" = {i} U s. In subsequent loss events, a subtree is built up 
with 2^ leaves, each of which is a species a C s U {i}. Call D(Ti) sum of the 
distances of all the leaves in from the root of T^. 

Let us now estimate the expectation value {D{Ti)), which only depends on 
the number of features of /. Trivially, D{Ti) is lower bounded by f2^ since the 
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Figure 4: Comparison of size-dependent summary statistics for models and 
real trees. Symbols distinguish the ERM model (o), the AB model (□) and the 
innovation model (o) and the data sets TreeBASE (*) and PANDIT (+). The 
data sets were preprocessed by solving monotomies and polytomies randomly 
as well as removing the outgroups as proposed by [6 . The mean values of 
depth, and Colless index, panels (a) and (b) are binned logarithmically as a 
function of tree size n. The same procedure is applied to the standard deviations, 
panels (c) and (d). The analysed TreeBASE data set has been downloaded 
from http://www.treebase.orgl on June, 2007 containing 5,087 trees of size 5 
to 535 after preprocessing. The PANDIT data set has been downloaded from 
'http: / / www.ebi.ac.uk/goldman-srv / pandit on May 2008 and includes 36,136 
preprocessed trees of size 5 to 2,562 . The simulated data set comprises for each 
model (AB model, ERM model and innovation model) 1,000 trees for each tree 
size from 5 to 535 and 10 trees for each tree size from 536 to 2,562. 
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Figure 5: The same values of depth and Cohess index as in Figure [4] (a,b) 
with an n-dependent rescahng. (a) Average depth divided by Inn. (b) Average 
Cohess index divided by In n. These factors are chosen such that the rescaled 
values for the ERM model asymptotically approach a constant. See refererence 
[7] for the scaling of the indices of the ERM model. 



most compact tree is the fully balanced one with all nodes at distance / from 
root. In particular, we conjecture 

f2f < {D{T,)) < DERM{2f) . (9) 

The second inequality is corroborated by the plots in Figure |6] We make it 
plausible as follows. Similar to the ERM model, a leaf is chosen in each time 
step when executing loss events. Here, however, the loss event is performed 
only if the chosen leaf carries the chosen feature and the reduced feature set is 
not yet present in the tree. Thus the probability of accepting a proposed loss 
event at a leaf s is anticorrelated with the number of features \s\ at s. The 
expected number of features carried by a leaf decreases with its distance from 
root. Therefore we argue that the present model adds new nodes preferentially 
to leaves closer to root than average, resulting in trees with an expected depth 
increasing more slowly than in the ERM model. 

4.2 Approximation of depth scaling 

We study a tree growth that is derived from the innovation model by two simpli- 
fying assumptions, (i) Each innovation is introduced at the leaf with the largest 
number of features in the tree, (ii) Introducing an innovation at a leaf with / 
features triggers the growth of a subtree that is a perfect (complete) binary tree 
with 2^ leaves at distance / from the root of this subtree. 

This leads us to consider the following deterministic growth starting with a 
single node and i = 0. Choose a leaf s at maximum distance from root; split 
s obtaining new leaves and s^^; take s^^ as the root of a newly added subtree 
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Figure 6: Average depth in dependence of the number of leaves n in trees 
generated with stochastic loss events (dots with error bars). Each data point is 
an average over 100 realizations with error bars indicating standard deviations. 
For comparison, the expected depth for the ERM model (□) and for complete 
binary trees (A) are shown. 




Figure 7: The deterministic growth of a tree considered as an approximation 
of the innovation model. Each subtree generated by an innovation is indicated 
as a shaded area. 
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that is a perfect tree with 2* leaves; increase i by one and iterate. Figure [t] 
ihustrates the first few steps of the growth. 

After i steps, the number of leaves added to the tree most recently is 2*~^. 
Therefore, the total number of leaves after step i is 

i 

n(i) = l + ^2^-^ = 2^ . (10) 

because the procedure starts with a single leaf at z = 0. 

The leaves of the subtree added by the j-th innovation have distance 



= (11) 

k=l 



from root because these leaves are j levels deeper than those generated by the 
previous innovation. Therefore the sum of all leaves' distances from root is 

n 

I)(z) = z + ^2^-Mj(J + 1)/2] (12) 

after the i-th innovation has been performed. The first term i arises because 
the innovation itself renders one previously existing leaf at a distance increased 
by one, cf. the leaves outside the shaded areas in Figure [7| In performing the 
sum of Equation [12] we use the equality 

i 

E^'"Mj(j + l)]=2i^'-^ + 2]-2 (13) 

to arrive at 

D{i) = i + - i + 2] - 1 . (14) 

We substitute n{i) = 2*, i.e. i = log2 n, and divide D hy n to arrive at the depth 

d{n) = \ [(log2 n)2 - (log2 n) + 2] + ~ ^ (15) 

of the tree with n leaves generated by deterministic growth. For large n, the 
depth scaling is 

d{n) - (logn)^ . (16) 

By the comparison in Fig. [sj we find the (logn)^ scaling also for the depth 
of trees obtained from the innovation model as defined in Section 12.61 Thus 
we hypothesize that the deterministic growth captures the essential mechanism 
leading to the depth scaling of the innovation model. The prefactor of (logn)^ is 
smaller in the innovation model than in the deterministic growth. In the actual 
model, most innovations hit a leaf with a non-maximal number of features and 
therefore trigger the growth of a lower subtree than assumed by deterministic 
growth. Table [l] provides an overview of the scaling of average depth with the 
number of leaves for various tree models . 
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Table 1: Depth scaling of models. 



innovation model 


(logn)^ 








\ogn if P > - 


-1, includes ERM (/3 = 0) 


/3-splitting pj 


< 


(logn)2 if/3 = - 
n-^-^ ifp<- 


1, AB model 

-1, includes PDA {(3 = -1.5) 


age model |18| 


(log n)^ 




activity model |15| 


1 


'^0.5 ifp = o.5, 
log n otherwise. 




complete tree 


log n 




comb tree 


n 






number of leaves n 



Figure 8: Depth as a function of tree size n for the innovation model (o) and for 
the deterministic growth (solid curve) according to Equation (15). Note that 
square root of depth is plotted such that a straight line in the plot indicates 
a depth scaling d{n) ^ (logn)^. For each size n, the plotted point (o) is the 
average over \/ d(n) for 100 independently generated trees. Error bars give the 
standard deviation. 
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5 Discussion 



The innovation model establishes a connection between the burstiness of macroevo- 
lution and the observed imbalance of phylogenetic trees. Bursts of diversification 
are triggered by generation of new features and combination with the repertoire 
of existing traits. In order to keep the model simple, the diversification after 
an innovation is implemented as a sequence of random losses of features. More 
realistic versions of the model could be studied where combinations of traits are 
enriched by re-activation of previously silenced traits or horizontal transfer be- 
tween species. Furthermore, the model as presented here neglects the extinction 
of species and their infiuence on the shapes of phylogenetic trees. 

Regarding the robustness of the model, the depth scaling would have to 
be tested under modifications. In particular, the infinite time scale separation 
between rare innonvations and frequent loss events could be given up by allowing 
innovations to occur at a finite rate set as a parameter. 

In summary, we have defined a well- working, biologically motivated model 
which nevertheless is suflaciently simple to allow for further enhancement re- 
garding biological concepts such as sequence evolution and genotype-phenotype 
relations. 
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